A Florida health insurance company wants to predict annual claims for individual clients. The company pulls a random sample of 100 customers. The owner wishes to charge an actuarially fair premium to ensure a normal rate of return. The owner collects all of their current customer’s health care expenses from the last year and compares them with what is known about each customer’s plan.

The data on the 100 customers in the sample is as follows:

  • Charges: Total medical expenses for a particular insurance plan (in dollars)
  • Age: Age of the primary beneficiary
  • BMI: Primary beneficiary’s body mass index (kg/m2)
  • Female: Primary beneficiary’s birth sex (0 = Male, 1 = Female)
  • Children: Number of children covered by health insurance plan (includes other dependents as well)
  • Smoker: Indicator if primary beneficiary is a smoker (0 = non-smoker, 1 = smoker)
  • Cities: Dummy variables for each city with the default being Sanford

Answer the following questions using complete sentences and attach all output, plots, etc. within this report.

insurance <- read.csv("../CodingAssignment03/Insurance_Data_Group9.csv")

Question 1

Randomly select 30 observations from the sample and exclude from all modeling. Provide the summary statistics (min, max, std, mean, median) of the quantitative variables for the 70 observations.

set.seed(123457)
exclude <- sample(nrow(insurance), 30)
train <- insurance[-exclude, ]
test <- insurance[exclude, ]

train %>%
  tbl_summary(statistic = list(all_continuous() ~ c("{mean}",       # Mean
                                                    "{sd}",         # Standard Deviation
                                                    "{median}",     # Median
                                                    "{min}",        # Minimum
                                                    "{max}"         # Maximum
                                                    )
                               ),
    type = all_continuous() ~ "continuous2" # Enhanced formatting for continuous variables
  )
Characteristic N = 701
Charges
    Mean 13,375
    SD 12,237
    Median 9,570
    Min 1,136
    Max 51,195
Age
    Mean 41
    SD 14
    Median 44
    Min 18
    Max 64
BMI
    Mean 31.1
    SD 5.7
    Median 30.8
    Min 16.0
    Max 47.7
Female 28 (40%)
Children
    0 27 (39%)
    1 15 (21%)
    2 16 (23%)
    3 9 (13%)
    4 2 (2.9%)
    5 1 (1.4%)
Smoker 16 (23%)
WinterSprings 17 (24%)
WinterPark 23 (33%)
Oviedo 14 (20%)
1 n (%)

Question 2

Provide the correlation between all quantitative variables

cor(train[, c("Charges", "Age", "BMI", "Children")])
##            Charges       Age       BMI  Children
## Charges  1.0000000 0.2669529 0.2394231 0.2497586
## Age      0.2669529 1.0000000 0.2372201 0.2987257
## BMI      0.2394231 0.2372201 1.0000000 0.1781634
## Children 0.2497586 0.2987257 0.1781634 1.0000000

Question 3

Run a regression that includes all independent variables in the data table. Does the model above violate any of the Gauss-Markov assumptions? If so, what are they and what is the solution for correcting?

allvar <- lm(Charges ~ Age + BMI + Female + Children + Smoker + WinterPark + WinterSprings + Oviedo, data = train)
summary(allvar)
## 
## Call:
## lm(formula = Charges ~ Age + BMI + Female + Children + Smoker + 
##     WinterPark + WinterSprings + Oviedo, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11292.8  -2542.3     18.9   3010.9  18364.1 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -15510.23    3872.49  -4.005 0.000171 ***
## Age              236.62      47.33   4.999 5.15e-06 ***
## BMI              457.04     117.63   3.886 0.000254 ***
## Female         -1722.18    1286.24  -1.339 0.185562    
## Children        1401.17     546.30   2.565 0.012799 *  
## Smoker         23987.60    1550.21  15.474  < 2e-16 ***
## WinterPark     -3948.67    1775.32  -2.224 0.029846 *  
## WinterSprings   -382.03    1950.68  -0.196 0.845382    
## Oviedo          -309.50    1913.52  -0.162 0.872042    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5074 on 61 degrees of freedom
## Multiple R-squared:  0.848,  Adjusted R-squared:  0.8281 
## F-statistic: 42.54 on 8 and 61 DF,  p-value: < 2.2e-16
plot(allvar)

Transforming Charges into the log function enables us to reduce heteroskedasticity and bring points closer to each other. After transforming charges into the log function, the histogram indicates that the log of charges has a normal distribution. Therefore, log of charges is a better representation of this variable.

Question 4

Implement the solutions from question 3, such as data transformation, along with any other changes you wish. Use the sample data and run a new regression. How have the fit measures changed? How have the signs and significance of the coefficients changed?

hist(train$Charges) #before

train$lnCharges <- log(train$Charges) 
hist(train$lnCharges) #after

hist(train$Charges) #before

train$ChargesSquared <- train$Charges^2 
hist(train$ChargesSquared) #after

scatterplotMatrix(train[c(10,2:3,5)]) # grabbing ln charges

par(mfrow=c(1,2)) # Lipton Input to place the charts side by side
train$ChildrenSquared <- train$Children^2
hist(train$ChildrenSquared) #after

train$lnChildren <- log(train$Children) 
hist(train$lnChildren) #after

scatterplotMatrix(train[c(10,2:3,12)]) # grabbing lnCharges with lnChildren

par(mfrow=c(1,2)) #  Input to place the charts side by side
hist(train$Age) #before

train$AgeSquared <- train$Age^2 
hist(train$AgeSquared) #after

hist(train$Age) #before

train$lnAge <- log(train$Age) 
hist(train$lnAge) #after

hist(train$BMI) #before

train$BMISquared <- train$BMI^2 
hist(train$BMISquared) #after

hist(train$BMI) #before

train$lnBMI <- log(train$BMI) 
hist(train$lnBMI) #after

Transforming the Children variable into the log and quadratic form does not change the variable in to a normal distribution. Therefore, the Children variable does not fit the log and quadratic form.

#Model 1

model_1 <- lm(lnCharges ~., data = train[,c(10,2:9)] )
summary(model_1)
## 
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2:9)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7258 -0.1434  0.0042  0.1503  1.4266 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.572798   0.255273  25.748   <2e-16 ***
## Age            0.035493   0.003120  11.376   <2e-16 ***
## BMI            0.020083   0.007754   2.590   0.0120 *  
## Female         0.121214   0.084788   1.430   0.1579    
## Children       0.088441   0.036012   2.456   0.0169 *  
## Smoker         1.656425   0.102189  16.209   <2e-16 ***
## WinterSprings  0.007457   0.128588   0.058   0.9539    
## WinterPark    -0.126200   0.117029  -1.078   0.2851    
## Oviedo        -0.028646   0.126139  -0.227   0.8211    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3345 on 61 degrees of freedom
## Multiple R-squared:  0.8794, Adjusted R-squared:  0.8636 
## F-statistic:  55.6 on 8 and 61 DF,  p-value: < 2.2e-16
#this model only change is log of charges, all independent variables remain the same.
par(mfrow=c(2,2))
plot(model_1)

#Model 2
model_2 <- lm(lnCharges ~., data = train[,c(10,3:9,15)] ) #pulling only columns I want
summary(model_2)
## 
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 3:9, 15)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55977 -0.14355  0.02305  0.10310  1.37177 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.188562   0.404219   7.888 6.85e-11 ***
## BMI            0.019762   0.007223   2.736  0.00814 ** 
## Female         0.109889   0.079076   1.390  0.16968    
## Children       0.072828   0.033830   2.153  0.03530 *  
## Smoker         1.631300   0.095059  17.161  < 2e-16 ***
## WinterSprings  0.002113   0.119861   0.018  0.98599    
## WinterPark    -0.130848   0.109018  -1.200  0.23468    
## Oviedo        -0.040374   0.117440  -0.344  0.73219    
## lnAge          1.339429   0.106512  12.575  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3118 on 61 degrees of freedom
## Multiple R-squared:  0.8952, Adjusted R-squared:  0.8815 
## F-statistic: 65.13 on 8 and 61 DF,  p-value: < 2.2e-16
#this model uses lncharges and lnAge
par(mfrow=c(2,2))
plot(model_2)

# Model 3
model_3 <- lm(lnCharges ~., data = train[,c(10,14,3:9)] ) #pulling only columns I want
summary(model_3)
## 
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 14, 3:9)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.87652 -0.16920  0.02645  0.16491  1.41726 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.189e+00  2.696e-01  26.663  < 2e-16 ***
## AgeSquared     4.115e-04  4.172e-05   9.862 3.02e-14 ***
## BMI            2.127e-02  8.500e-03   2.503  0.01501 *  
## Female         1.304e-01  9.298e-02   1.402  0.16598    
## Children       1.065e-01  3.919e-02   2.718  0.00854 ** 
## Smoker         1.663e+00  1.122e-01  14.823  < 2e-16 ***
## WinterSprings  6.428e-03  1.410e-01   0.046  0.96380    
## WinterPark    -1.267e-01  1.284e-01  -0.987  0.32778    
## Oviedo        -2.027e-02  1.386e-01  -0.146  0.88417    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3669 on 61 degrees of freedom
## Multiple R-squared:  0.8549, Adjusted R-squared:  0.8359 
## F-statistic: 44.92 on 8 and 61 DF,  p-value: < 2.2e-16
#this model uses lncharges and AgeSquared
par(mfrow=c(2,2)) #residuals vs fitted
plot(model_3) #q-q residuals

#Model 4
model_4 <- lm(lnCharges ~., data = train[,c(10,2,4:9,17)] ) #pulling only columns I want
summary(model_4)
## 
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2, 4:9, 17)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73920 -0.13134  0.00264  0.13730  1.41835 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.101250   0.784230   6.505 1.63e-08 ***
## Age            0.035289   0.003141  11.236  < 2e-16 ***
## Female         0.118779   0.084788   1.401   0.1663    
## Children       0.089731   0.035970   2.495   0.0153 *  
## Smoker         1.655747   0.102238  16.195  < 2e-16 ***
## WinterSprings  0.017437   0.128540   0.136   0.8925    
## WinterPark    -0.120176   0.116477  -1.032   0.3063    
## Oviedo        -0.029406   0.126237  -0.233   0.8166    
## lnBMI          0.613866   0.238249   2.577   0.0124 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3346 on 61 degrees of freedom
## Multiple R-squared:  0.8793, Adjusted R-squared:  0.8634 
## F-statistic: 55.53 on 8 and 61 DF,  p-value: < 2.2e-16
#this model uses lncharges and lnBMI
par(mfrow=c(2,2)) #residuals vs fitted
plot(model_4) #Q-Q Residuals

#Model 5
model_5 <- lm(lnCharges ~., data = train[,c(10,2,4:9,16)] ) #pulling only columns I want
summary(model_5)
## 
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2, 4:9, 16)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70989 -0.16308  0.00825  0.14457  1.43884 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.886e+00  1.816e-01  37.926   <2e-16 ***
## Age            3.573e-02  3.106e-03  11.503   <2e-16 ***
## Female         1.227e-01  8.492e-02   1.445   0.1536    
## Children       8.775e-02  3.610e-02   2.431   0.0180 *  
## Smoker         1.656e+00  1.023e-01  16.188   <2e-16 ***
## WinterSprings  6.694e-05  1.289e-01   0.001   0.9996    
## WinterPark    -1.288e-01  1.176e-01  -1.096   0.2776    
## Oviedo        -2.688e-02  1.262e-01  -0.213   0.8321    
## BMISquared     3.049e-04  1.191e-04   2.561   0.0129 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3349 on 61 degrees of freedom
## Multiple R-squared:  0.8791, Adjusted R-squared:  0.8633 
## F-statistic: 55.46 on 8 and 61 DF,  p-value: < 2.2e-16
#this model uses lncharges and BMISquared
par(mfrow=c(2,2))
plot(model_5)

#Model 6
model_6 <- lm(lnCharges ~., data = train[,c(10,2:4,6:9,12,5)] ) #pulling only columns I want
summary(model_6)
## 
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2:4, 6:9, 12, 
##     5)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70026 -0.14405 -0.00955  0.13481  1.37468 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      6.620186   0.259924  25.470  < 2e-16 ***
## Age              0.035077   0.003150  11.136 3.11e-16 ***
## BMI              0.018474   0.007929   2.330   0.0232 *  
## Female           0.132595   0.085614   1.549   0.1267    
## Smoker           1.650975   0.102378  16.126  < 2e-16 ***
## WinterSprings   -0.007636   0.129557  -0.059   0.9532    
## WinterPark      -0.137528   0.117643  -1.169   0.2470    
## Oviedo          -0.055094   0.129051  -0.427   0.6710    
## ChildrenSquared -0.024082   0.024630  -0.978   0.3321    
## Children         0.171712   0.092472   1.857   0.0682 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3346 on 60 degrees of freedom
## Multiple R-squared:  0.8813, Adjusted R-squared:  0.8635 
## F-statistic: 49.49 on 9 and 60 DF,  p-value: < 2.2e-16
#this model uses lncharges and ChildrenSquared
par(mfrow=c(2,2))
plot(model_6)

#make children a dummy variable
train$childdummy <- 0 
train$childdummy[train$Children > 0] <- 1


#Model 7
model_7 <- lm(lnCharges ~., data = train[,c(10,2:4,6:9,18)] ) #pulling only columns I want
summary(model_7)
## 
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2:4, 6:9, 18)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65227 -0.13093 -0.01566  0.13835  1.37804 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.642231   0.251498  26.411  < 2e-16 ***
## Age            0.035427   0.003043  11.644  < 2e-16 ***
## BMI            0.016919   0.007792   2.171  0.03380 *  
## Female         0.133797   0.083224   1.608  0.11307    
## Smoker         1.663931   0.100444  16.566  < 2e-16 ***
## WinterSprings -0.023576   0.121822  -0.194  0.84719    
## WinterPark    -0.161530   0.112590  -1.435  0.15649    
## Oviedo        -0.068542   0.123236  -0.556  0.58012    
## childdummy     0.263374   0.090093   2.923  0.00485 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3284 on 61 degrees of freedom
## Multiple R-squared:  0.8838, Adjusted R-squared:  0.8685 
## F-statistic: 57.97 on 8 and 61 DF,  p-value: < 2.2e-16
#this model uses lncharges and dummy children 
scatterplotMatrix(train[c(10,2:3,18)]) # grabbing lnCharges

par(mfrow=c(1,2)) # Lipton Input to place the charts side by side
par(mfrow=c(2,2))
plot(model_7)

#

Question 5

Use the 30 withheld observations and calculate the performance measures for your best two models. Which is the better model? (remember that “better” depends on whether your outlook is short or long run)

test$lnCharges <- log(test$Charges)
test$childdummy <- 0 
test$childdummy[test$Children > 0] <- 1


test$insurance_model_pred <- predict(allvar, newdata = test)

test$model_1_pred <- predict(model_1,newdata = test) %>% exp()

test$model_7_pred <- predict(model_7,newdata = test) %>% exp()

# Finding the error

test$error_bm <- test$insurance_model_pred - test$Charges

test$error_1 <- test$model_1_pred - test$Charges

test$error_7 <- test$model_7_pred - test$Charges

Question 6

Provide interpretations of the coefficients, do the signs make sense? Perform marginal change analysis (thing 2) on the independent variables.

#

Question 7

An eager insurance representative comes back with five potential clients. Using the better of the two models selected above, provide the prediction intervals for the five potential clients using the information provided by the insurance rep.

Customer Age BMI Female Children Smoker City
1 60 22 1 0 0 Oviedo
2 40 30 0 1 0 Sanford
3 25 25 0 0 1 Winter Park
4 33 35 1 2 0 Winter Springs
5 45 27 1 3 0 Oviedo
#

Question 8

The owner notices that some of the predictions are wider than others, explain why.

Question 9

Are there any prediction problems that occur with the five potential clients? If so, explain.